Loading Data From An HTTP Server Tutorial

MLDB gives users full control over where and how data is persisted. MLDB handles multiple protocol for URLs (see Files and URLs). In this tutorial, we provide examples to load files via http:// or https:// for files accessible on a HTTP server on the public internet or a private intranet.

For an example using the file:// for a file inside an MLDB container, see the Loading Data Tutorial for an example. MLDB also supports loading files from Amazon S3 and SFTP servers transparently. See the documentation for Files and URLs for more details.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [2]:

    
from pymldb import Connection
mldb = Connection()

Loading data with http:// or https://

MLDB makes it very easy to load data from a public web server, since a file location can be specified using a remote URI. To illustrate this, we have chosen to load a file from the Facebook Social Circles dataset, hosted by the Stanford Network Analysis Project (SNAP), who provide many public datasets.

We will simply import the file http://snap.stanford.edu/data/facebook_combined.txt.gz using the import.text procedure. Notice that not only is the file hosted on a remote server, but it is also compressed. MLDB will decompress it seamlessly as it's being downloaded.



In [3]:

    
dataUrl = "http://snap.stanford.edu/data/facebook_combined.txt.gz"

print mldb.put("/v1/procedures/import_data", {
    "type": "import.text",
    "params": {
        "dataFileUrl": dataUrl,
        "headers": ["node", "edge"],
        "delimiter": " ", 
        "quoteChar": "",
        "outputDataset": "import_URL1",
        "runOnCreation": True
    }
})









    



<Response [201]>

We can now take a look:



In [5]:

    
mldb.query("SELECT * FROM import_URL1 LIMIT 5")

Accessing a specific file inside an archive

If the targeted file is inside an archive (.tar or .zip), we can specify the specific file we want to extract, as seen in the example below. Here, we load the 3980.circles file within the facebook folder:



In [4]:

    
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"

print mldb.put("/v1/procedures/import_data", {
    "type": "import.text",
    "params": {
        "dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
        "headers": ["circles"],
        "delimiter": " ", 
        "quoteChar": "",
        "outputDataset": "import_URL2",
        "runOnCreation": True
    }
})









    



<Response [201]>

Let's query our dataset to see what the data looks like:



In [5]:

    
mldb.query("SELECT * from import_URL2 LIMIT 5")









    Out[5]:






  
    
      
      circles
    
    
      _rowName
      
    
  
  
    
      1
      circle0\t3989\t4009
    
    
      2
      circle1\t4010\t4037
    
    
      3
      circle2\t4013
    
    
      4
      circle3\t4024\t3987\t4015
    
    
      5
      circle4\t4006

The next step would be to format the data in a way we can easily query it. This is shown in the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial, where we structure the data in a nicer way.

Conclusion

With support for multiple protocol types, MLDB makes it easy to load data that resides anywhere. As seen in this tutorial, you can even pinpoint the exact file to load within an archive's folder structure, allowing flexible data management.

Where to next?

Check out the other Tutorials and Demos.

	circles
_rowName
1	circle0\t3989\t4009
2	circle1\t4010\t4037
3	circle2\t4013
4	circle3\t4024\t3987\t4015
5	circle4\t4006